Summary

This project builds a classification model using logistic regression to analyze various diabetic patient features to predict whether or not the patient will be readmitted to the hospital again or not.

Introduction

For this project, we are trying to answer the predictive question: Given a diabetic patient’s demographic, medication history and management of diabetes during hospital stay, can we predict if they will be readmitted to the hospital or not?

Due to Covid-19, it is critical to reduce the burden on the healthcare system and prevent readmission rates from increasing to make space for Covid cases. Our predictor aims to look at the diabetes management and diagnosis during a patient’s hospital stay to understand how much this affects their readmission. Analysis with machine learning models will identify features more likely to predict patient readmission. This will allow us to create and improve patient safety protocols to better manage diabetic patients during their hospital stay to provide effective care and prevent readmission during this critical time.

Methods

The R programming language (R Core Team 2020) and Python programming language (Van Rossum and Drake 2009) were used to perform the analysis. The following R and Python packages were also used to perform the analysis:

For statistical analysis (SCRIPT4) specifically:

The code used to perform the analysis and create this report can be found here.

(“Insulin, Medicines, &Amp; Other Diabetes Treatments” 2016) (2019)

Data

The data are submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058, and a recipient of the CERNER data. This dataset was collected from 1998-2008 among 130 hospitals and integrated delivery networks throughout the United States of America.

This data set was sourced from the UCI Machine Learning Repository (Strack et al. 2014a) and can be found here. Research from this collected data was used to assess diabetic care during hospitalization and determine if patients were likely to be readmitted or not. The paper by (Strack et al. 2014b) can be found here. Each row corresponds a unique encounter with a diabetic patient, totaling 74,036,643 unique encounters. Details about each column feature of information collected during these unique encounters can be found here.

Analysis

After data cleaning, machine learning model logistic regression was tested against RBF SVM and a baseline dummy classifier method. Logistic regression was determined as the best model in terms of fit and score time, accuracy, and f1 score. Continuing with logistic regression, hyperparameters were optimized and our model was then used to predict diabetic patient readmission (found in the readmitted target column of the data set). The code used to perform the analysis and create this report can be found [here].(INSERT_URL_TO_SCRIPT_4)

Results and Discussion

Through exploratory data analysis, we determined that some of the features were not informative to answering our question or contained many missing values. This was confirmed through Pandas Profiling which can be found here as well as correlation between specific features and potential class imbalance based on the target readmitted column. Correlation between certain numerical values shown in Pandas Profiling was confirmed when we analyzed interactions between the features.

Figure 1. Interactions and Correlation of Numeric Features

Figure 1. Interactions and Correlation of Numeric Features

Data cleaning was done to address non-informative features, class imbalance, NAN values, and duplicate encounters, and this code can be found here. Particularly, the distribution of race was reviewed to determine that there was no significant difference between different races in patient readmission and therefore the race column was removed from our analysis.

Figure 2. Distribution of Race and Readmission Status

Figure 2. Distribution of Race and Readmission Status

Several of the numerical features were reviewed by plotting the distribution of the feature for the two different target classes against each other and found the primary diagnosis of the patient, days spent in hospital, number of medications taken and the frequency of lab procedures to be most informative of predicting readmission status. The importance of the primary diagnosis feature is that if the primary diagnosis is diabetes perhaps patients received better care during their first encounter and thus do not get readmitted (Strack et al. 2014b).

Figure 3. Distribution of Numeric Features and Readmission Status

Figure 3. Distribution of Numeric Features and Readmission Status

Several of the categorical features were also reviewed in a similar format to the numerical features analysis, and found that metformin, insulin, and Hemoglobin A1C levels were most informative of predicting readmission status. Testing for hemoglobin levels may have resulted in applicable changes to the medications and thus better management of diabetes during their hospital stay, preventing readmission (Strack et al. 2014b).

Figure 4. Distribution of Categorical Features and Readmission Status

Figure 4. Distribution of Categorical Features and Readmission Status

After exploratory data analysis and data cleaning, machine learning models were tested with our data set, and logistic regression (balanced) was determined as the best model to move forward with in our analysis based on fit and score time, accuracy, and f1 scores as shown in Table 1. The code used to perform machine learning can be found [here].(INSERT_URL_TO_SCRIPT_4).

Table 1. Classifier Models and Their Associated Times and Scores
fit_time score_time test_accuracy train_accuracy test_f1 train_f1 test_recall train_recall test_precision train_precision test_average_precision train_average_precision test_roc_auc train_roc_auc
0.0311 0.0404 0.5525 0.5066 0.5041 0.4663 0.4853 0.4607 0.5251 0.4726 0.5251 0.4726 0.4757 0.5142
0.0751 0.0511 0.5963 0.7431 0.4962 0.6831 0.4240 0.5913 0.5987 0.8096 0.5987 0.8096 0.6394 0.8429
0.0981 0.0289 0.6050 0.8472 0.5620 0.8324 0.5413 0.8100 0.5856 0.8563 0.5856 0.8563 0.6284 0.9177
0.1010 0.0319 0.6100 0.8475 0.5870 0.8380 0.5920 0.8413 0.5834 0.8348 0.5834 0.8348 0.6278 0.9180

Hyperparameter optimization was performed and identified potential improvements for our model in future use. From the confusion matrix, we can see that there is a high amount of false positives and false negatives. From the ROC curve with AUC, and the red point showing our threshold at 0.5, our model is performing at 0.54 where AUC = 1.0 is a perfect classification. These results suggest that our model could be improved by choosing a different classifier or testing other scoring methods for our optimization besides f1-score. Our model performed average on the test data, and will therefore require improvements before implementation on deployment data for use in clinical studies. Details about these improvements can be found in the Future Improvements section of this report.

Figure 5. Confusion Matrix and ROC AUC Curve Results From Logistic Regression (Balanced)Figure 5. Confusion Matrix and ROC AUC Curve Results From Logistic Regression (Balanced)

Figure 5. Confusion Matrix and ROC AUC Curve Results From Logistic Regression (Balanced)

Completing our analysis using our model on test data, the top 10 features most indicative of patient readmission were found by having the highest associated coefficients as shown in Table 2. These results suggest the healthcare system begin implementing changes in these specific features, particularly in admission type ID, in order to manage diabetic patient readmission more effectively.

Table 2. Top 10 Features Most Indicative of Patient Readmission
Features Coefficients
admission_type_id 0.2663170
discharge_disposition_id 0.2435044
admission_source_id 0.2473305
time_in_hospital 0.0084882
num_lab_procedures 0.0297474
num_procedures 0.0265883
num_medications 0.0958197
number_outpatient 0.2936842
number_emergency 0.1059684
number_inpatient 0.7724087

Future Improvements

To improve our model in the future to further analyze patient readmission, we can recommend three main suggestions. First, rather than subsetting our data to take a representative and random sample of 1000, we can take a larger subset or even use our entire data set of roughly 70,000 observations. Our reasoning for subsetting the data was to reduce fit and score time of our models. In future models, the use of a larger data set will provide more accurate predictions, reduce bias, and identify outliers that may skew our results. Second, other classifier models such as random forest could be used to test against our logistic regression model to improve predictions. The use of logistic regression assumes there is a linear relationship the independent and dependent variables, which may not be the case in our project. Random forest are non-parametric and will bypass this linearity assumption associated with logistic regression. Third, our model analyzes a binary classification of readmission although the original data set was a multi-classification of readmission (not readmitted, readmitted for less than 30 days, readmitted for more than 30 days). By using a multi-classification model instead, we can generate a more specific prediction and analyze the severity of features related to readmission time, making our conclusions more informative to improving the healthcare system.

References

Brugman, Simon. 2019. “pandas-profiling: Exploratory Data Analysis for Python.” https://github.com/pandas-profiling/pandas-profiling.

Chandra, Rakesh Vidya, and Bala Subrahmanyam Varanasi. 2015. Python Requests Essentials. Packt Publishing Ltd.

Harris, Charles R., K. Jarrod Millman, St’efan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, et al. 2020. “Array Programming with NumPy.” Nature 585 (7825): 357–62. https://doi.org/10.1038/s41586-020-2649-2.

Hunter, John D. 2007. “Matplotlib: A 2D Graphics Environment.” Computing in Science & Engineering 9 (3): 90–95.

“Insulin, Medicines, &Amp; Other Diabetes Treatments.” 2016. National Institute of Diabetes and Digestive and Kidney Diseases. U.S. Department of Health; Human Services. https://www.niddk.nih.gov/health-information/diabetes/overview/insulin-medicines-treatments.

Kolhatkar, Varada. n.d. “DSCI: 571 Supervised Learning 1”. "https://github.ubc.ca/MDS-2020-21/DSCI_571_sup-learn-1_students".

McKinney, Wes, and others. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, 445:51–56. Austin, TX.

Ostblom, Joel. n.d. “DSCI: 531 Data Visualization 1”. "https://github.ubc.ca/MDS-2020-21/DSCI_531_viz-1_students".

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (Oct): 2825–30.

Pérez, Fernando, and Brian E Granger. 2007. “IPython: A System for Interactive Scientific Computing.” Computing in Science & Engineering 9 (3).

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rossum, Guido "Van, and Fred L" Drake Jr. n.d.a. “Core Tools for Working with Streams, Python/Cpython”. "https://github.com/python/cpython/blob/3.9/Lib/io.py".

———. n.d.b. “Secure Hashes and Message Digests, Python/Cpython”. "https://docs.python.org/3/library/hashlib.html".

———. n.d.c. “URL Handling Modules, Python/Cpython”. "https://github.com/python/cpython/tree/3.9/Lib/urllib/".

———. n.d.d. “Warning Control, Python/Cpython”. "https://docs.python.org/3/library/warnings.html".

———. n.d.e. “Work with Zip Archives, Python/Cpython”. "https://github.com/python/cpython/blob/3.9/Lib/zipfile.py".

Strack, Beata, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore. 2014a. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.

———. 2014b. “Impact of Hba1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.” BioMed Research International. Hindawi. https://www.hindawi.com/journals/bmri/2014/781670/.

VanderPlas, Jacob, Brian Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert. 2018. “Altair: Interactive Statistical Visualizations for Python.” Journal of Open Source Software 3 (32): 1057.

Van Rossum, Guido, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Xie, Yihui. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.

2019. Altair Github Repository. https://github.com/altair-viz/altair/issues/1281.